NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

NOIR: Privacy-Preserving Generation of Code with Open-Source LLMs

Nguyen, Khoa; Ton, That_Khiem; Phan, NhatHai; Khalil, Issa; Tran, Khang; Borcea, Cristian; Jin, Ruoming; Khreishah, Abdallah; Thai, My (August 2026, the 35th Usenix Security Symposium (Usenix Security 2026))

Full Text Available
PAC-Bayes Bounds for Multivariate Linear Regression and Linear Autoencoders

Guo, Ruixin; Jin, Ruoming; Li, Xinyu; Zhou, Yang (December 2025, 39th Conference on Neural Information Processing Systems (NeurIPS 2025))

Linear Autoencoders (LAEs) have shown strong performance in state-of-the-art recommender systems. However, this success remains largely empirical, with limited theoretical understanding. In this paper, we investigate the generalizability – a theoretical measure of model performance in statistical learning – of multivariate linear regression and LAEs. We first propose a PAC-Bayes bound for multivariate linear regression, extending the earlier bound for single-output linear regression by Shalaeva et al. [45], and establish sufficient conditions for its convergence. We then show that LAEs, when evaluated under a relaxed mean squared error, can be interpreted as constrained multivariate linear regression models on bounded data, to which our bound adapts. Furthermore, we develop theoretical methods to improve the computational efficiency of optimizing the LAE bound, enabling its practical evaluation on large models and real-world datasets. Experimental results demonstrate that our bound is tight and correlates well with practical ranking metrics such as Recall@K and NDCG@K.
more » « less
Full Text Available
Flexible, Efficient, and Stable Adversarial Attacks on Machine Unlearning

Zhou, Zihan; Zhou, Yang; Zhang, Zijie; Lyu, Lingjuan; Yan, Da; Jin, Ruoming; Dou, Dejing (July 2025, Proceedings of the 42nd International Conference on Machine Learning (ICML'25))

Machine unlearning (MU) aims to remove the influence of specific data points from trained models, enhancing compliance with privacy regulations. However, the vulnerability of basic MU models to malicious unlearning requests in adversarial learning environments has been largely overlooked. Existing adversarial MU attacks suffer from three key limitations: inflexibility due to pre-defined attack targets, inefficiency in handling multiple attack requests, and instability caused by non-convex loss functions. To address these challenges, we propose a Flexible, Efficient, and Stable Attack (DDPA). First, leveraging Carathéodory's theorem, we introduce a convex polyhedral approximation to identify points in the loss landscape where convexity approximately holds, ensuring stable attack performance. Second, inspired by simplex theory and John's theorem, we develop a regular simplex detection technique that maximizes coverage over the parameter space, improving attack flexibility and efficiency. We theoretically derive the proportion of the effective parameter space occupied by the constructed simplex. We evaluate the attack success rate of our DDPA method on real datasets against state-of-the-art machine unlearning attack methods. Our source code is available at https://github.com/zzz0134/DDPA.
more » « less
Full Text Available
Flexible, Efficient, and Stable Adversarial Attacks on Machine Unlearning

Zhou, Zihan; Zhou, Yang; Zhang, Zijie; Lyu, Lingjuan; Yan, Da; Jin, Ruoming; Dou, Dejing (July 2025, Proceedings of the 42nd International Conference on Machine Learning (ICML'25))

Machine unlearning (MU) aims to remove the influence of specific data points from trained models, enhancing compliance with privacy regulations. However, the vulnerability of basic MU models to malicious unlearning requests in adversarial learning environments has been largely overlooked. Existing adversarial MU attacks suffer from three key limitations: inflexibility due to pre-defined attack targets, inefficiency in handling multiple attack requests, and instability caused by non-convex loss functions. To address these challenges, we propose a Flexible, Efficient, and Stable Attack (DDPA). First, leveraging Carathéodory's theorem, we introduce a convex polyhedral approximation to identify points in the loss landscape where convexity approximately holds, ensuring stable attack performance. Second, inspired by simplex theory and John's theorem, we develop a regular simplex detection technique that maximizes coverage over the parameter space, improving attack flexibility and efficiency. We theoretically derive the proportion of the effective parameter space occupied by the constructed simplex. We evaluate the attack success rate of our DDPA method on real datasets against state-of-the-art machine unlearning attack methods. Our source code is available at https://github.com/zzz0134/DDPA.
more » « less
Full Text Available
Federated Contrastive Learning of Graph-Level Representations

Li, Xiang; Agrawal, Gagan; Ramnath, Rajiv; Jin, Ruoming (December 2024, IEEE)

Full Text Available
Fisher Information-based Efficient Curriculum Federated Learning with Large Language Models

Liu, Ji; Ren, Jiaxiang; Jin, Ruoming; Zhang, Zijie; Zhou, Yang; Valduriez, Patrick; Dou, Dejing (November 2024, Proceedings of the 29th Conference on Empirical Methods in Natural Language Processing (EMNLP'24))

As a promising paradigm to collaboratively train models with decentralized data, Federated Learning (FL) can be exploited to fine-tune Large Language Models (LLMs). While LLMs correspond to huge size, the scale of the training data significantly increases, which leads to tremendous amounts of computation and communication costs. The training data is generally non-Independent and Identically Distributed (non-IID), which requires adaptive data processing within each device. Although Low-Rank Adaptation (LoRA) can significantly reduce the scale of parameters to update in the fine-tuning process, it still takes unaffordable time to transfer the low-rank parameters of all the layers in LLMs. In this paper, we propose a Fisher Information-based Efficient Curriculum Federated Learning framework (FibecFed) with two novel methods, i.e., adaptive federated curriculum learning and efficient sparse parameter update. First, we propose a fisher information-based method to adaptively sample data within each device to improve the effectiveness of the FL fine-tuning process. Second, we dynamically select the proper layers for global aggregation and sparse parameters for local update with LoRA so as to improve the efficiency of the FL fine-tuning process. Extensive experimental results based on 10 datasets demonstrate that FibecFed yields excellent performance (up to 45.35% in terms of accuracy) and superb fine-tuning speed (up to 98.61% faster) compared with 17 baseline approaches).
more » « less
Full Text Available
Scalable Deep Metric Learning on Attributed Graphs

Li, Xiang; Agrawal, Gagan; Jin, Ruoming; Ramnath, Rajiv (July 2024, LNCS)

We consider the problem of constructing embeddings of large attributed graphs and supporting multiple downstream learning tasks. We develop a graph embedding method, which is based on extending deep metric and unbiased contrastive learning techniques to 1) work with attributed graphs, 2) enabling a mini-batch based approach, and 3) achieving scalability. Based on a multi-class tuplet loss function, we present two algorithms -- DMT for semi-supervised learning and DMAT-i for the unsupervised case. Analyzing our methods, we provide a generalization bound for the downstream node classification task and for the first time relate tuplet loss to contrastive learning. Through extensive experiments, we show high scalability of representation construction, and in applying the method for three downstream tasks (node clustering, node classification, and link prediction) better consistency over any single existing method.
more » « less
Full Text Available
User Dynamics and Thematic Exploration in r/Depression During the COVID-19 Pandemic: Insights From Overlapping r/SuicideWatch Users

https://doi.org/10.2196/53968

Zhu, Jianfeng; Jin, Ruoming; Kenne, Deric R; Phan, NhatHai; Ku, Wei-Shinn (May 2024, Journal of Medical Internet Research)

BackgroundIn 2023, the United States experienced its highest- recorded number of suicides, exceeding 50,000 deaths. In the realm of psychiatric disorders, major depressive disorder stands out as the most common issue, affecting 15% to 17% of the population and carrying a notable suicide risk of approximately 15%. However, not everyone with depression has suicidal thoughts. While “suicidal depression” is not a clinical diagnosis, it may be observed in daily life, emphasizing the need for awareness. ObjectiveThis study aims to examine the dynamics, emotional tones, and topics discussed in posts within the r/Depression subreddit, with a specific focus on users who had also engaged in the r/SuicideWatch community. The objective was to use natural language processing techniques and models to better understand the complexities of depression among users with potential suicide ideation, with the goal of improving intervention and prevention strategies for suicide. MethodsArchived posts were extracted from the r/Depression and r/SuicideWatch Reddit communities in English spanning from 2019 to 2022, resulting in a final data set of over 150,000 posts contributed by approximately 25,000 unique overlapping users. A broad and comprehensive mix of methods was conducted on these posts, including trend and survival analysis, to explore the dynamic of users in the 2 subreddits. The BERT family of models extracted features from data for sentiment and thematic analysis. ResultsOn August 16, 2020, the post count in r/SuicideWatch surpassed that of r/Depression. The transition from r/Depression to r/SuicideWatch in 2020 was the shortest, lasting only 26 days. Sadness emerged as the most prevalent emotion among overlapping users in the r/Depression community. In addition, physical activity changes, negative self-view, and suicidal thoughts were identified as the most common depression symptoms, all showing strong positive correlations with the emotion tone of disappointment. Furthermore, the topic “struggles with depression and motivation in school and work” (12%) emerged as the most discussed topic aside from suicidal thoughts, categorizing users based on their inclination toward suicide ideation. ConclusionsOur study underscores the effectiveness of using natural language processing techniques to explore language markers and patterns associated with mental health challenges in online communities like r/Depression and r/SuicideWatch. These insights offer novel perspectives distinct from previous research. In the future, there will be potential for further refinement and optimization of machine classifications using these techniques, which could lead to more effective intervention and prevention strategies.
more » « less
Full Text Available
On Item-Sampling Evaluation for Recommender System

https://doi.org/10.1145/3629171

Li, Dong; Jin, Ruoming; Liu, Zhenming; Ren, Bin; Gao, Jing; Liu, Zhi (March 2024, ACM Transactions on Recommender Systems)

Personalized recommender systems play a crucial role in modern society, especially in e-commerce, news, and ads areas. Correctly evaluating and comparing candidate recommendation models is as essential as constructing ones. The common offline evaluation strategy is holding out some user-interacted items from training data and evaluating the performance of recommendation models based on how many items they can retrieve. Specifically, for any hold-out item or so-called target item for a user, the recommendation models try to predict the probability that the user would interact with the item and rank it among overall items, which is calledglobal evaluation. Intuitively, a good recommendation model would assign high probabilities to such hold-out/target items. Based on the specific ranks, some metrics likeRecall@KandNDCG@Kcan be calculated to further quantify the quality of the recommender model. Instead of ranking the target items among all items, Koren first proposed to rank them among a smallsampled set of items, then quantified the performance of the models, which is calledsampling evaluation. Ever since then, there has been a large amount of work adopting sampling evaluation due to its efficiency and frugality. In recent work, Rendle and Krichene argued that the sampling evaluation is “inconsistent” with respect to a global evaluation in terms of offline top-Kmetrics. In this work, we first investigate the “inconsistent” phenomenon by taking a glance at the connections between sampling evaluation and global evaluation. We reveal the approximately linear relationship between sampling with respect to its global counterpart in terms of the top-KRecall metric. Second, we propose a new statistical perspective of the sampling evaluation—to estimate the global rank distribution of the entire population. After the estimated rank distribution is obtained, the approximation of the global metric can be further derived. Third, we extend the work of Krichene and Rendle, directly optimizing the error with ground truth, providing not only a comprehensive empirical study but also a rigorous theoretical understanding of the proposed metric estimators. To address the “blind spot” issue, where accurately estimating metrics for small top-Kvalues in sampling evaluation is challenging, we propose a novel adaptive sampling method that generalizes the expectation-maximization algorithm to this setting. Last but not least, we also study the user sampling evaluation effect. This series of works outlines a clear roadmap for sampling evaluation and establishes a foundational theoretical framework. Extensive empirical studies validate the reliability of the sampling methods presented.
more » « less
Full Text Available
AEDFL: Efficient Asynchronous Decentralized Federated Learning with Heterogeneous

Liu, Ji; Che, Tianshi; Zhou, Yang; Jin, Ruoming; Dai, Huaiyu; Dou, Dejing; Valduriez, Patrick (April 2024, Proceedings of the 24th SlAM International Conference on Data Mining (SDM'24))

Full Text Available

« Prev Next »

Search for: All records